Bayesian regression facilitates quantitative modelling of cell metabolism
This paper presents Maud, a command-line application that implements Bayesian statistical inference for kinetic models of biochemical metabolic reaction networks. Maud takes into account quantitative information from omics experiments and background knowledge, as well as structural information about kinetic mechanisms, regulatory interactions and enzyme knockouts. Our paper reviews the existing options in this area, presents a case study illustrating how Maud can be used to analyse a metabolic network and explains the biological, statistical and computational design decisions underpinning Maud.
1 Introduction
A kinetic model of cellular metabolism aims to express what is known about a cellular process in the form of an in silico representation of the underlying network of chemical reactions. Kinetic models can be used to improve production of target molecules, determine regulatory networks (Christodoulou et al. 2018) and identify potential drug targets (DeBerardinis and Chandel 2016; Liberti et al. 2017). However, the use of kinetic models in practice is hindered by their dependence on noisy and uncertain information sources. Quantitative in vivo measurements of chemical abundances, and in vitro measurements relating to kinetic parameters, both contain vital information but are notoriously inaccurate (REFERENCES). Practically useful kinetic modelling therefore requires a principled statistical approach that encompasses multiple possible model parameterisations.
Bayesian statistical inference can combine the structural information implicit in kinetic models with knowledge about metabolic parameters and information from omics measurements (Saa and Nielsen 2016; Gopalakrishnan, Dash, and Maranas 2020). However, kinetic models pose serious computational challenges for Bayesian inference (Gutenkunst et al. 2007; A. Raue et al. 2010).
The scope of a kinetic model is defined by a stoichiometric matrix, \(S\), in which rows represent metabolites, columns represent reactions, and matrix elements \(s_{ij}\) represent the stoichiometric coefficient of metabolite \(i\) in reaction \(j\). The change in metabolite concentrations is:
\[\begin{equation}\label{eq-1} \frac{dC}{dt} = S\cdot v - \mu\cdot C \end{equation}\]
In equation \(\eqref{eq-1}\), C represents a vector of metabolite concentrations, \(v\) is a vector of reaction rates, and \(\mu\) is the growth rate. The second term represents the dilution due to cell growth.
In a kinetic model, the rates, \(v\), are expressed as a function of the enzyme concentrations, \(E\), the metabolite concentrations, \(C\), and a set of parameters, \(\theta\) as shown in equation \(\eqref{eq-2}\)
\[\begin{equation}\label{eq-2} v = f(C, E, \theta) \end{equation}\]
The parameters must include sufficient boundary concentrations and fluxes to solve equation \(\eqref{eq-1}\).
It is common to assume pseudo-steady state for metabolites, i.e., the rate of fluxes towards any metabolite is much greater than the rate of change in concentration, \(𝑣 \gg \frac{𝑑𝐶}{𝑑𝑡}\). Moreover, the dilution effect is assumed minimal, \(\mu\cdot C \ll \vec{v}\) (true unless the concentration is very high). Finally, the enzyme concentration is assumed constant for the period considered and hence part of the parameters.
Given these assumptions and a set of values for , a set of steady state metabolite concentrations and fluxes can be found by solving for \(C\) the algebraic equation:
\[\begin{equation}\label{eq-3} S\cdot f(C;\theta) = 0 \end{equation}\]
In a fermentation context, equation \(\eqref{eq-3}\) captures the rapid kinetics inside the cell, while another set of ODEs would be used to describe the external substrate and product concentrations, which could act as boundary parameters to equation \(\eqref{eq-3}\).
In the context of kinetic modelling, Bayesian inference is appealing because it allows uncertainty to be represented appropriately without sacrificing mechanistic accuracy. Measurement uncertainty can naturally be represented in a Bayesian measurement model, whereas the prior model can represent quantitative uncertainty about kinetic parameters. Finally, kinetic rate laws can be represented in Bayesian data generation models with arbitrarily high fidelity. See Gelman, Carlin, et al. (2020) for more about Bayesian inference and Gelman, Vehtari, et al. (2020) for a discussion of practical Bayesian workflow.
Another advantage is that Bayesian inference problems are well-posed even when not all parameters are strongly identified. Sloppy models in which measurable quantities are sensitive to combinations of parameters but not to individual marginal parameter values are ubiquitous in models of biological systems (Gutenkunst et al. 2007; White et al. 2016). The parameter correlation structure represents the set of potential models that describe the observed data. As we demonstrate in Section 2.2.2 below, capturing this correlation structure is difficult outside a fully Bayesian context.
Previous Bayesian kinetic models have either sacrificed mechanistic accuracy or have attempted to fit realistic kinetic models using obsolete or unreliable computational methods.
The most popular algorithm for fitting Bayesian statistical models is Markov Chain Monte Carlo (MCMC). Modern MCMC algorithms allow exploration of high- dimensional posterior distributions, have robust failure diagnostics (Vehtari et al. 2021) and can incorporate fast numerical solvers, thereby making inference feasible for Bayesian kinetic models. Nonetheless, the kinetic modelling literature reports an aversion to MCMC, rooted mainly in concerns about sampling time and the presumed difficulty of implementing the required statistical model (Andreas Raue et al. 2013; Saa and Nielsen 2016). We are only aware of two recent attempts to implement a Bayesian kinetic modelling approach using MCMC. Stapor et al. (2018) fitted detailed kinetic models using relatively inefficient MCMC algorithms that do not scale well to high dimensional parameter spaces limiting the scope of modelling. Conversely, St. John et al. (2018) utilises an efficient sampling algorithm but uses approximate kinetics, namely lin-log kinetics Visser and Heijnen (2003), limiting the scope of interpreting parameters and inferring cellular behaviour in experimental conditions outside the reference dataset.
There have also been efforts to implement Bayesian inference for kinetic models without the use of MCMC. Examples of alternative inference methods include variational inference (St. John et al. 2018), rejection sampling and approximate Bayesian computation (Saa and Nielsen 2016) and Laplace approximation, in which the Fisher information matrix is used to calculate a normal approximation around the maximum a posteriori parameter configuration (Liebermeister and Noor 2021; Gopalakrishnan, Dash, and Maranas 2020; Stapor et al. 2018; Andreas Raue et al. 2013). Non-MCMC-based Bayesian kinetic models have limited utility because they lack reliable diagnostic tools for verifying that their results approximate the target posterior distribution. This is a problem because realistic kinetic models tend to induce highly correlated, non-Gaussian, joint probability distributions (Gutenkunst et al. 2007; Stapor et al. 2018).
Our application Maud provides the first Bayesian kinetic model to combine biologically realistic mechanistic accuracy—including accurate rate laws, post- translational modification and thermodynamics—with fast, robust MCMC sampling using adaptive Hamiltonian Monte Carlo. Further, Maud is a general-purpose application that can be used to fit a wide range of Bayesian kinetic models.
2 Results and Discussion
To demonstrate our application’s capabilities, we used Maud to analyse an artificial dataset based on the human methionine cycle. We generated this dataset using Maud, by simulating measurements for a set of training and validation conditions based on plausible parameter values. Next, we performed posterior inference for the training measurements and prediction for both training and validation measurements.
We investigated Maud’s sensitivity to missing measurements by comparing the results of fitting a full dataset with an intentionally incomplete dataset. To demonstrate why a full Bayesian approach is preferable to an approach based on a Laplace approximation of the posterior distribution, we compared the results of analysing a representative metabolic network using both methods.
Finally, we dug into our results to find out what our full dataset methionine model learned about the contributions of different regulatory factors to the flux through GNMT, an important reaction. This analysis illustrates how Maud can be used to generate actionable insights about metabolism without the need for further statistical analysis.
The methionine cycle, illustrated in Figure 1, is a fundamental pathway in human metabolism, whose intermediate metabolites participate in a variety of mechanisms which must compete for the same resources. Due to this competition, as well as the fact that all the functions occur simultaneously, the methionine cycle is highly regulated, with 6 known allosteric effectors (Martinov et al. 2000; Nijhout et al. 2006; Korendyaseva et al. 2008). This complex regulation means that quantitative modelling of the methionine cycle requires a detailed kinetic model: this is why we chose it as a case study for Maud.
2.1 Dataset and model specification
The simulated dataset and underlying kinetic model that we used for our analysis can be found at https://github.com/biosustain/Methionine_model/tree/main/data/methionine and is described in supporting information section 3.
We constructed a kinetic model of the methionine cycle in Maud’s format using the description in Korendyaseva et al. (2008). The ordinary differential equation system describing this model is shown in Equation \(\eqref{eq-meth-ode}\).
\[\begin{align} \frac{d[met]}{dt} &= v_{Influx} - v_{PROT} - v_{MAT} +v_{MS} + v_{BHMT} \label{eq-meth-ode} \\ \frac{d[amet]}{dt} &= v_{MAT} - v_{GNMT} - v_{METH} \nonumber \\ \frac{d[ahyc]}{dt} &= v_{GNMT} + v_{METH} - v_{AHC} \nonumber \\ \frac{d[hyc-L]}{dt} &= v_{AHC} - v_{CBS} - v_{MS} - v_{BHMT} \nonumber \\ \frac{d[5mthf]}{dt} &= v_{MTHFR} - v_{MS} \nonumber \end{align}\]
After specifying the qualitative aspects of the kinetic model, we selected parameter values to use as ground truth by Monte Carlo sampling using a previous model of the methionine cycle as a starting point (see Saa and Nielsen (2016) for this model).
We used these parameters to simulate steady states in a range of plausible experimental conditions, again using Saa and Nielsen (2016) as a starting point. These steady states were then used to generate simulated measurements using the measurement model described below in Section 2.1.1.
2.1.1 Measurement model
Generating an artificial dataset required a specification of the true measurement error. For enzyme and metabolite concentration measurements we specified a standard deviation of 0.1 on natural logarithmic scale, corresponding to approximately 10% measurement error. For each reaction measurement a measurement standard deviation of approximately 10% of the simulated value.
These measurement error specifications are somewhat optimistic considering the many sources of variation and uncertainty affecting quantitative proteomics, metabolomics and fluxomics analyses, but are a reasonable first approximation to a realistic set of measurements.
For our main model run, we assumed that all metabolite and enzyme concentrations were measured, and that there was a reaction measurement for each of the network’s elementary flux modes.
2.1.2 Training/validation split
The split between training and validation conditions was selected to achieve a large difference in fluxes between the two categories. The split was determined as we are interested in showing how our model can fit to varied conditions, and conditions closer to the training set are likely to be predicted well without necessarily learning the system.
2.1.3 Additional dataset with missing measurements
To gain insight into our model’s robustness to missing measurements, we also performed a model run with the same 6 experimental datasets, but with measurements of the metabolite S-Adenosyl-L-homocysteine, or “ahcys” removed. Since ahcys regulates three enzymes in the methionine cycle, including one enzyme which is also thermodynamically regulated, we expected the removal of these measurements to yield interesting results.
2.1.4 Maud input specification
We constructed inputs in Maud’s format for each of the analysed datasets, based on the scenario that the true kinetic model was known except for parameter values, which needed to be inferred from the training data and priors. These inputs can be found at https://github.com/biosustain/Methionine_model/tree/main/data.
The prior distributions and corresponding true parameter values used in our case study are shown in supporting information section 3.2. They were chosen to reflect a plausible pre-experimental information state. In 7 cases, the marginal prior distribution for a parameter disagrees with the true parameter values used to generate the data; a similar situation is likely to occur in practice due to in vivo vs in vitro measurement differences.
2.2 Findings
2.2.1 Posterior inference
Running standard diagnostic checks indicated that the samples we generated were from the target posterior distribution. The improved \(\hat{R}\) statistic (Vehtari et al. 2021) for every variable of interest was within 2% of 1, indicating appropriate mixing within and between Markov chains. Additionally, the number of effective samples was high, indicating that we generated enough posterior samples to support inferences about the bulks of the distributions of the sampled parameters. Furthermore, we observed no post warm-up divergent transitions, indicating that the sampler was able to transform the log-posterior distribution, avoiding any regions with excessive curvature that might inhibit exploration via HMC.
Posterior predictive checking indicated that our model achieved a good fit to the simulated reaction and metabolite concentration measurements, as shown by the graphs in the top row of Figure 2.
Analysis of the posterior distributions for kinetic parameters indicated that these are highly correlated. The marginal posterior distributions for most kinetic parameters did not shrink significantly compared with the corresponding marginal prior distributions, even though these parameters’ joint posterior distribution contained enough information to make accurate out of sample predictions. In some cases, there were two-dimensional correlations such as the one shown in the bottom left of Figure 2; in this case the marginal distribution of the two parameters is roughly banana shaped. More commonly, however, two-dimensional pair plots were insufficient to reveal the underlying correlation structure, as seen in the bottom-right plot in figure Figure 2. This does not mean that the parameters were uncorrelated, but rather that the correlations involve more than two parameters.
These results show that Maud can achieve an adequate fit to a realistic pathway-sized dataset. This was achieved without fixing the marginal values of kinetic parameters: the information required to make good predictions was contained in the correlation structure of the joint posterior distribution. This finding is consistent with previous analyses of biological systems that found they are “sloppy”, that is, sensitive to parameter combinations rather than marginal parameter values, with important combinations, scales and regions of sensitivity being difficult to ascertain in advance (Gutenkunst et al. 2007; Poirier 1998).
The question naturally arises whether the crucial high-dimensional parameter correlations are linear or non-linear. This is relevant to the question of model performance, as linear correlations are easier to correct for. A linearly correlated posterior space would also be easier to summarise. We address this question below in Section 2.2.2.
This case study illustrates the type of kinetic model and dataset that Maud can fit. The model we analysed has 10 reactions, 5 state variables and 212 parameters. Generalising from our ability to fit this model in a reasonable time using Maud, we expect that Maud can be used to fit realistic Bayesian models of approximately the same size, but not, for example, genome-scale kinetic models. To fit larger models, faster steady state solving methods or alternative inference algorithms will be required. Section 2.2.2 addresses whether Laplace approximation is a suitable candidate.
2.2.2 Comparison with Laplace approximation
We found that the Laplace method was not able to produce an accurate posterior approximation for our model and dataset.
Our overall strategy was to compare samples generated using MCMC, which we were confident could be treated as draws from the true posterior distribution, with approximate posterior samples generated using the Laplace method. We were unable to generate approximate posterior samples for our main methionine cycle case study using Laplace approximation, as the algorithm could not recover from solver failures caused by unrealistic parameter configurations. We therefore made a comparison for a simpler model. This input can be found at https://github.com/biosustain/Methionine_model/tree/main/data/example_ode.
MCMC sampling for our comparison input yielded 800 samples in 625 minutes; Laplace sampling yielded the same number of samples in only one minute. The diagnostics indicated that our algorithm was able to find the maximum a posteriori parameter configuration, approximate the Hessian and use these quantities to generate approximate posterior samples. The results can be found at https://github.com/biosustain/Methionine_model/tree/main/results.
Figure 3 summarises the results of comparing the samples generated using each method. As can be seen from the top left plot, the Laplace method does not provide a good approximation to the true posterior distribution in this case, as the marginal distribution of the total log probability density is clearly different. This impression was confirmed using the Kolmogorov-Smirnov test, which is a test to differentiate two empirical univariate distributions. The two distributions were significantly different with a p-value indistinguishable from zero.
The difference between the Laplace approximation output and the true posterior distribution manifests not only in the parameter space, but also in the measurement space. The upper right (training data) and lower left (validation data) plots in Figure 3 show 5%-95% marginal flux measurement log likelihood intervals for both methods; lower log likelihood values indicate that the modelled and measured values are further away. The graph shows that the Laplace approximation yielded significantly worse predictions than the true posterior, even for the training data.
To further explore why this is the case we compared samples from the true posterior and the Laplace approximation for the pairwise marginal distributions of two Michaelis-Menten constants \(K_{m}^{A,r1}\) and \(K_{cat}^{r1}\): see the bottom right cell of Figure 3. This comparison demonstrates that the Laplace method is not able to capture the correct relationships between parameters’ distributions.
This result shows that MCMC, while slower than Laplace approximation, is unfortunately preferable for posterior inference in this case. We expect that the Laplace method will produce worse approximations the more complex the target model. Since the approximation is already unacceptable for our simple test model, we recommend that Maud users use MCMC sampling in preference to Laplace approximation if possible when fitting realistic Bayesian kinetic models.
Our results here also provide circumstantial evidence that the parameter correlations in Bayesian kinetic model posteriors tend to be non-linear, as a posterior with only linear correlations would likely be more germane to Laplace approximation. A conclusion that we drew from this analysis was that the results of fitting our model cannot be summarised simply, for example by fitting a multivariate normal distribution to the posterior draws. We therefore recommend that Maud users store the full set of MCMC draws rather than using such an approximation. This does not preclude the possibility that there is an alternative, more compact, way to summarise the results of Bayesian kinetic model inference; we leave research into this topic to future work.
2.2.3 Effect of missing metabolite concentration measurements
Comparing model runs with and without the ahcys measurements showed that Maud can produce sensible results even from incomplete metabolomics data.
Figure 4 shows that, as might be expected, the model with missing measurements did not correctly infer the missing ahcys concentrations. Nonetheless, the remaining measured metabolites were still well predicted, suggesting that information about the network is still preserved despite the missing measurements. Comparison of flux measurements in both models also indicated that removing the ahcys measurement did not result in catastrophic model failure.
The missing measurements did affect Maud’s ability to infer parameter values correctly. The lower left plot of Figure 4 shows that the model with full metabolomics learned the true value for the displayed dissociation constant, despite this value being far from the mean of the corresponding marginal prior distribution. In contrast, the model with missing measurements stayed in the neighbourhood of the prior.
This result is reassuring because not having access to all measurements is a common situation in multi-omics studies. For instance, measuring all metabolites in a pathway can be infeasible because of limitations of mass spectrometers, availability of standards, column effects, and compartmentalisation. However, provided that sufficient information is available from other sources, our approach can produce sensible results from incomplete metabolomics data.
2.2.4 Application to regulatory understanding
To demonstrate how Maud’s output can be used to yield useful metabolic insights we used the results of our case study to explain why the flux of the enzyme \(GNMT\) is higher in dataset 1 than in dataset 12. GNMT is an irreversible enzyme that is homotropically activated by its substrate, competitively inhibited by its product and heterotropically inhibited by 5,10- methylenetetrahydrofolate (mlthf). The diverse regulation makes it the ideal test case to elucidate regulatory changes.
Figure 5 shows the regulatory description of GNMT, according to the results our main case study analysis. Each curve shows the marginal posterior distribution of the ratio of the corresponding regulatory component in dataset 1 compared with dataset 12. A positive value indicates that the component was increased in dataset 1 relative to dataset 12, with 0 indicating no difference. The probability, according to our model, of the component acting in each direction is given by the relative area under the curve on each side of the 0 point.
Our model correctly inferred that saturation and allosteric effects were the main drivers of regulation between the two datasets in this case, with the curves for each component aligning with the ground truth shown in red.
Importantly, this form of analysis takes into account all modelled sources of uncertainty, including uncertainty about the measured values of the flux in each dataset. Our result shows that Maud could be used in this realistic case not only to provide a user with an explanation for an observed difference in fluxes, but also a reasonable judgement as to the explanation’s robustness.
3 Methods
Maud is a command line application implementing Bayesian inference for a wide range of realistic kinetic models. Maud is written in Python (Van Rossum and Drake 2009), designed for use on Windows, macOS and Linux, is pip installable from the Python Package Index as maud-metabolic-models, documented at https://maud-metabolic-models.readthedocs.io and actively developed and maintained at https://github.com/biosustain/Maud/.
To use Maud, a user must first collate appropriate input information, represent it in files with Maud’s required formats (see supporting information section on input format). Maud’s command line interface provides commands for inference via a range of algorithms including adaptive Hamiltonian Monte Carlo, as well as commands for simulation and making out-of-sample predictions. Results are stored in files, using a structured, interoperable format.
As well as parameter values, Maud also performs inference for derived quantities including the components of the regulatory decomposition described below in equation \(\eqref{eq-decomposition}\), log likelihoods, simulated measurement values and metabolic control analysis coefficients (Kacser and Burns 1973).
In the rest of this section we describe Maud’s kinetic model at high level and discuss Maud’s statistical model, implementation and how it solves the crucial steady state problem. Further details about Maud’s kinetic model can be found in the supporting information.
3.1 Kinetic model
Maud’s kinetic model decomposes into factors as shown in equation \(\eqref{eq-decomposition}\).
\[\begin{equation} F(C;\theta) = Enzyme\cdot k_{cat}\cdot Reversibility \cdot Saturation \cdot Allostery \label{eq-decomposition} \end{equation}\]
Each of the terms on the right-hand side of \(\eqref{eq-decomposition}\) is a function of \(C\) and \(\theta\). This idea is taken from Noor et al. (2013). The terms usefully gather physically meaningful and conceptually distinct factors contributing to reaction fluxes. \(Enzyme\) captures the effect of enzyme concentration, \(k_{cat}\) that of enzyme efficiency, \(Reversibility\) quantifies thermodynamic effects, \(Saturation\) the effect of enzyme availability and \(Allostery\) the effect of post translational modifications.
We used the model of enzyme saturation from Liebermeister, Uhlendorf, and Klipp (2010) and the generalised Monod-Wyman-Changeux model of Allosteric regulation introduced in (Monod, Wyman, and Changeux 1965; Changeux 2013; Popova and Sel’kov 1975, 1979) and used more recently in Matos et al. (2022). To capture the effect on enzyme activity of coupled phosphorylation and dephosphorylation processes we developed a new mathematical model inspired by the generalised MWC model of allosteric regulation. Full details of all mathematical aspects of Maud’s kinetic model can be found in supporting information section 2.
3.2 Statistical model
Maud represents information from measurements using generalised linear regression models that probabilistically connect realised measurements with true values of the measurable quantities. Information from other sources is represented using a prior distribution over a set of latent parameters. The parameters and the measurable quantities are connected by a generative model encompassing Maud’s kinetic model as well as the steady state equation \(\eqref{eq-3}\). Together, the measurement model, prior model and generative model determine a joint probability function that assigns a probability density to any possible combination of measurements and parameters.
Below we describe the prior and measurement models, as the generative model has already been discussed above.
3.2.1 Prior model
Maud’s prior model includes unknown parameters corresponding to quantities in the kinetic model that are assumed to be unknown, other than steady state metabolite concentrations and fluxes, which are derived from the values of other parameters by solving the steady state problem. See table (REFERENCE) in this paper’s supporting information for a description of all these parameters and their dimensions. Note that some quantities in Maud’s kinetic model are not treated as parameters: for example, temperatures, compartment volumes and the formation energy of water. Maud treats these quantities as if they were known precisely: they can be configured by the user or default values can be used. Although in practice there can be considerable uncertainty regarding these quantities, we chose to disregard this uncertainty in the interest of simplicity.
Except for metabolites’ standard condition Gibbs energy changes of formation, Maud uses independent normal prior distributions for parameters that can in principle be both negative and positive. For parameters that are constrained to be positive, Maud uses independent log-normal distributions. Formation energy parameters have a multivariate normal prior distribution. Location, scale and covariance parameters for all these prior distributions can be selected freely by the user.
3.2.2 Measurement model
Maud’s measurement model considers three types of measurement: metabolite concentration measurements, enzyme concentration measurements and flux measurements, represented by vectors \(𝑦^{𝑐𝑜𝑛𝑐}\) \(𝑦^{𝑒𝑛𝑧}\) and \(𝑦^{𝑓𝑙𝑢𝑥}\) respectively.
All measurements are specific to an experimental condition; that is, a case where the true state of the network, including knockouts, boundary conditions and state variables as well as kinetic and thermodynamic parameters, can safely be assumed to be the same. Maud’s statistical model allows for arbitrarily many experimental conditions, and for any measurable quantity to be measured any number of times in any condition.
Metabolite and enzyme measurements are intended to represent the results of quantitative metabolomics and proteomics experiments. The likelihood functions for such measurements are shown in equations Equation \(\eqref{eq-yconc}\) and Equation \(\eqref{eq-yenz}\).
\[\begin{align} y_i^{conc} &\sim LN(\ln{\hat{y}_i^{conc}}, \sigma_i^{conc})\label{eq-yconc} \\ y_i^{enz} &\sim LN(\ln{\hat{y}_i^{enz}}, \sigma_i^{enz})\label{eq-yenz} \end{align}\]
Both equations are log-normal generalised linear models with a standard link function (the natural logarithm \(\ln\)) and known standard deviation \(\sigma_{𝑐𝑜𝑛𝑐}\). The use of this measurement model is motivated by the consideration that concentrations are constrained to be non-negative, so the measurement model should avoid assigning positive probability mass to negative metabolite concentration values. In addition, we expect the precision of most metabolomics and proteomics experiments to be roughly proportional to the value of the true measured quantity, which supports a measurement model with constant coefficient of variation. The measurement standard deviations \(\sigma_{𝑐𝑜𝑛𝑐}\) and \(\sigma_{𝑒𝑛𝑧}\) are assumed to be known exactly for simplicity; plausible values can be elicited by considering the likely coefficient of variation of the measuring apparatus.
Flux measurements, representing the results of quantitative fluxomics analyses, are modelled using a likelihood function from a standard linear regression model, as shown in equation \(\eqref{eq-yflux}\). Flux measurements can be obtained from the results of isotope labelling experiments using metabolic flux analysis, for example as described in (Young 2014). When entering flux measurements, it is important only to specify measurements for a network’s free fluxes, as the values of some steady state fluxes in a metabolic network are constrained by others, with the result that dependent fluxes cannot typically be measured separately. If measurements of multiple dependent fluxes are entered, information will inappropriately be double counted.
\[\begin{equation} y_i^{flux} \sim LN(\ln{\hat{y}_i^{flux}}, \sigma_i^{flux})\label{eq-yflux} \end{equation}\]
Our measurement model improves on analyses of metabolomics and proteomics data that assume a regression model with normally distributed errors, whether explicitly using a standard linear model or implicitly using ordinary least squares fitting. This assumption is undesirable because it implies that the measured quantity could in principle be negative, and assumes an additive underlying random process, whereas multiplicative processes tend to better describe real concentration data.
The use of independent measurement models for metabolite, enzyme and flux measurements carries an implicit assumption that there are no systematic correlations in the measurement errors. This choice was motivated by simplicity - it would be better to use a model with potentially correlated measurements. Similarly, it would be preferable to include measurement errors as model parameters, thereby avoiding possible bias due to incorrect assessments of measurement accuracy. However, we chose to use a simpler measurement model to avoid the complexity and potential fitting issues that these changes would entail.
Finally, the reader may wonder why Maud uses a linear regression model for reaction flux measurements even though this creates the potential for erroneous double counting and requires non-trivial upstream modelling, as intracellular fluxes typically cannot be measured directly. Instead, fluxes are typically inferred from isotope labelling experiments using metabolic flux analysis: see Dai and Locasale (2017) for more about this method. Ideally Maud’s measurement model for fluxes would extend from fluxes to the results of potential labelling experiments, thereby removing the need for upstream analysis and avoiding any double counting. This option has not yet been pursued, again for the sake of simplicity.
3.3 Implementation
Maud uses the Python library click (Click Developers 2022) to implement a command line interface. The command line interface loads input files as Python dictionaries, which are parsed using the Python library toml (Pearson 2020) and then validated and converted into structured MaudInput objects using Pydantic (Pydantic developers 2022). Maud’s statistical model is implemented in the probabilistic programming language Stan (Carpenter et al. 2017) and accessed using the interface cmdstanpy (Stan Development Team 2022). For posterior sampling, Maud uses the MaudInput to create an input file for Stan and then trigger posterior sampling using adaptive Hamiltonian Monte Carlo. See Betancourt (2018) for more about this algorithm.
When sampling is complete, Maud converts to the output into the standard format InferenceData using the Python library arviz (Kumar et al. 2019) and saves it as a json file, along with some information for debugging. This format allows for easy checking of MCMC diagnostics including divergent transitions, effective sample size and the improved \(\hat{R}\) statistic proposed in Vehtari et al. (2021).
3.4 Solving the steady state problem
Maud’s parameters are connected with measurable quantities via the steady state equation \(\eqref{eq-steady}\). Posterior sampling using adaptive Hamiltonian Monte Carlo requires repeatedly solving equation \(\eqref{eq-steady}\) and finding the gradients of its solution with respect to all model parameters. This must be done numerically as analytic solutions are not available for realistic kinetic models.
The speed at which this problem can be solved is tightly coupled with the size and complexity of metabolic network that can practically be modelled. See Timonen et al. (2022) for more about considerations involved in this kind of modelling.
To solve equation \(\eqref{eq-steady}\) and find its gradients, Maud uses a hybrid method involving two numerical solvers from the SUNDIALS suite (Serban and Hindmarsh 2005): CVODES and IDAS via their interface from Stan. The hybrid method follows that proposed by Margossian (2018), and involves numerically evolving the ODE system for a short period of time, then using the difference between the evolved and starting concentrations as the target for a numerical algebra solver.
4 Supporting information
The analyses described in this paper, as well as instructions for reproducing our results, figures and paper, can be found at https://biosustain/Methionine_model.
The document supporting_information.pdf contains the following information:
- Description of Maud’s input format
- Description of Maud’s kinetic model including rate equations as well as parameters and their dimensions.
- Details of the procedure used to generate our case study results, including generating the artificial case study datasets, specifying priors and carrying out computation.
- Specification of the prior distribution used for the case studies.
5 Acknowledgements
This research was funded by the Novo Nordisk foundation using grant number NNF14OC0009473.